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Abstract 

In this work, we propose a novel recurrent neu¬ 
ral network (RNN) architecture. The proposed 
RNN, gated-feedback RNN (GF-RNN), extends 
the existing approach of stacking multiple recur¬ 
rent layers by allowing and controlling signals 
flowing from upper recurrent layers to lower lay¬ 
ers using a global gating unit for each pair of 
layers. The recurrent signals exchanged between 
layers are gated adaptively based on the previous 
hidden states and the current input. We evalu¬ 
ated the proposed GF-RNN with different types 
of recurrent units, such as tanh, long short-term 
memory and gated recurrent units, on the tasks 
of character-level language modeling and Python 
program evaluation. Our empirical evaluation of 
different RNN units, revealed that in both tasks, 
the GF-RNN outperforms the conventional ap¬ 
proaches to build deep stacked RNNs. We sug¬ 
gest that the improvement arises because the GF- 
RNN can adaptively assign different layers to dif¬ 
ferent timescales and layer-to-layer interactions 
(including the top-down ones which are not usu¬ 
ally present in a stacked RNN) by learning to gate 
these interactions. 


1. Introduction 


Recurrent neural networks (RNNs) have been widely stud¬ 
ied and used for various machine learning tasks which in¬ 
volve sequence modeling, especially when the input and 
output have variable lengths. Recent studies have revealed 
that RNNs using gating units can achieve promising re¬ 
sults in both classiflcation and generation tasks (see, e.g., 
GravesI |2013t [Bahdanau et al.| |2014t |Sutskever et al.| 
20T^ . 


Although RNNs can theoretically capture any long-term 
dependency in an input sequence, it is well-known to be 
difficult to train an RNN to actually do so ( [Hochreit^ 


1991[ |Bengio et al. \\99A\ |Hochreiter[ |1998| ). One of the 
most successful and promising approaches to solve this is¬ 
sue is by modifying the RNN architecture e.g., by using a 
gated activation function, instead of the usual state-to-state 
transition function composing an affine transformation and 
a point-wise nonlinearity. A gated activation function. 


such as the long short-term memory (LSTM, Hochreiter 
& Schmidhuberj \\991) and the gated recurrent unit (GRU, 
Cho et al.[|20T^ , IS designed to have more persistent mem¬ 
ory so that it can capture long-term dependencies more eas- 
ily. 

Sequences modeled by an RNN can contain both fast 
changing and slow changing components, and these un¬ 
derlying components are often structured in a hierarchical 


manner, which, as first pointed out by El Hihi & Bengio 
( |1995 ) can help to extend the ability of the RNN to learn 
to model longer-term dependencies. A conventional way to 
encode this hierarchy in an RNN has been to stack multi¬ 
ple levels of recurrent layers ( |Schmidhuber[ |1992[ |E1 Hihi 
& Bengio] |1995t |Graves| 2013[ Hermans & SchrauwenJ 
2013| ). More recently, Koutnik et al. ( 2014| ) proposed a 
more explicit approach to partition the hidden units in an 
RNN into groups such that each group receives the sig¬ 
nal from the input and the other groups at a separate, pre¬ 
defined rate, which allows feedback information between 
these partitions to be propagated at multiple timescales. 
Stollenga et al. ( 2014| ) recently showed the importance of 
feedback information across multiple levels of feature hier¬ 
archy, however, with feedforward neural networks. 


In this paper, we propose a novel design for RNNs, called a 
gated-feedback RNN (GE-RNN), to deal with the issue of 
learning multiple adaptive timescales. The proposed RNN 
has multiple levels of recurrent layers like stacked RNNs 
do. However, it uses gated-feedback connections from up¬ 
per recurrent layers to the lower ones. This makes the hid¬ 
den states across a pair of consecutive timesteps fully con¬ 
nected. To encourage each recurrent layer to work at differ¬ 
ent timescales, the proposed GE-RNN controls the strength 
of the temporal (recurrent) connection adaptively. This ef- 
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fectively lets the model to adapt its structure based on the 
input sequence. 

We empirically evaluated the proposed model against the 
conventional stacked RNN and the usual, single-layer RNN 
on the task of language modeling and Python program eval¬ 
uation ( [Zaremba & Sutskever||2Q14] ). Our experiments re¬ 
veal that the proposed model significantly outperforms the 
conventional approaches on two different datasets. 

2. Recurrent Neural Network 

An RNN is able to process a sequence of arbitrary length 
by recursively applying a transition function to its internal 
hidden states for each symbol of the input sequence. The 
activation of the hidden states at timestep t is computed as a 
function / of the current input symbol Xt and the previous 
hidden states ht_i: 


ht =/(xt,ht_i). 

It is common to use the state-to-state transition function / 
as the composition of an element-wise nonlinearity with an 
affine transformation of both and ht_i: 


h,=(/>(If-x, + f/h,_i), (1) 

where W is the input-to-hidden weight matrix, 1/ is the 
state-to-state recurrent weight matrix, and (/) is usually a 
logistic sigmoid function or a hyperbolic tangent function. 

We can factorize the probability of a sequence of arbitrary 
length into 

p(xi,--- ,Xr) =p(xi)p(x2 I Xi)---p(Xj, \xi,--- ,Xt-i) 

Then, we can train an RNN to model this distribution by 
letting it predict the probability of the next symbol 
given hidden states ht which is a function of all the previ¬ 
ous symbols xi, • • • , Xt-i and current symbol Xt'. 


p{xt+i \xi,--- ,Xt) =g{ht). 


This approach of using a neural network to model a prob¬ 
ability distribution over sequences is widely used, for in¬ 
stance, in language modeling (see, e.g.,|Bengio et al.||2001l 
|Mikolovl|20T2l ). 


2.1. Gated Recurrent Neural Network 


Long short-term memory (LSTM) was proposed by 
Hochreiter & Schmidhuber] ( |1997| ) to specifically address 
this issue of learning long-term dependencies. The LSTM 
maintains a separate memory cell inside it that updates and 
exposes its content only when deemed necessary. More re¬ 
cently, |Cho et ah] ( |2014| ) proposed a gated recurrent unit 
(GRU) which adaptively remembers and forgets its state 
based on the input signal to the unit. Both of these units are 
central to our proposed model, and we will describe them 
in more details in the remainder of this section. 


2.1.1. Long Short-Term Memory 


Since the initial 1997 proposal, several variants of the 
LSTM have been introduced ( |Gers et ah] |2000t |Zaremba| 


et al. 2014). Here we follow the implementation provided 


by Zaremba et al. (2014). 


Such an LSTM unit consists of a memory cell c^, an input 
gate it, 2 iforget gate ft, and an output gate Ot. The memory 
cell carries the memory content of an LSTM unit, while 
the gates control the amount of changes to and exposure 
of the memory content. The content of the memory cell 
c[ of the j-th LSTM unit at timestep t is updated similar 
to the form of a gated leaky neuron, i.e., as the weighted 
sum of the new content and the previous memory content 
c[_i modulated by the input and forget gates, il and //, 
respectively: 


J _ fjj 
G ~ Jt H-i ^ 


( 2 ) 


where 


Ct = tanh (WcXt + Uch.t-i). (3) 

The input and forget gates control how much new content 
should be memorized and how much old content should be 
forgotten, respectively. These gates are computed from the 
previous hidden states and the current input: 

it =cr (VLiXt + Ui^t-i ), (4) 

ft=a(IL/Xt + f//ht_i), (5) 

where it = [^t ]^_i = [ftYj^-i respectively the 

vectors of the input and forget gates in a recurrent layer 
composed of p LSTM units, cr(-) is an element-wise logis¬ 
tic sigmoid function. Xt and ht-i are the input vector and 
previous hidden states of the LSTM units, respectively. 


The difficulty of training an RNN to capture long-term de¬ 
pendencies has been known for long ( [Hochreit^ |1991t 
|Bengio et al.| |1994[ |Hochreit^ |1998| ). A previously suc¬ 
cessful approaches to this fundamental challenge has been 
to modify the state-to-state transition function to encourage 
some hidden units to adaptively maintain long-term mem¬ 
ory, creating paths in the time-unfolded RNN, such that 
gradients can flow over many timesteps. 


Once the memory content of the LSTM unit is updated, the 
hidden state hi of the j-th LSTM unit is computed as: 

hi = ol tanh (^4) ■ 

The output gate ol controls to which degree the memory 
content is exposed. Similarly to the other gates, the out¬ 
put gate also depends on the current input and the previous 
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hidden states such that 

0( = (7(VFoX( + C/oht_i). (6) 


In other words, these gates and the memory cell allow an 
LSTM unit to adaptively forget, memorize and expose the 
memory content. If the detected feature, i.e., the memory 
content, is deemed important, the forget gate will be closed 
and carry the memory content across many timesteps, 
which is equivalent to capturing a long-term dependency. 
On the other hand, the unit may decide to reset the memory 
content by opening the forget gate. Since these two modes 
of operations can happen simultaneously across different 
LSTM units, an RNN with multiple LSTM units may cap¬ 
ture both fast-moving and slow-moving components. 


2.1.2. Gated Recurrent Unit 


The GRU was recently proposed by |Cho et'SI] ( |2Q14| ). Like 
the LSTM, it was designed to adaptively reset or update 
its memory content. Each GRU thus has a reset gate rl 
and an update gate zl which are reminiscent of the forget 
and input gates of the LSTM. However, unlike the LSTM, 
the GRU fully exposes its memory content each timestep 
and balances between the previous memory content and the 
new memory content strictly using leaky integration, albeit 
with its adaptive time constant controlled by update gate 
H • 

At timestep t, the state h\ of the j-th GRU is computed by 


K = 0- - zDK-i + 


(7) 


where hl_i and hi respectively correspond to the previ¬ 
ous memory content and the new candidate memory con¬ 
tent. The update gate zl controls how much of the previous 
memory content is to be forgotten and how much of the 
new memory content is to be added. The update gate is 
computed based on the previous hidden states ht_i and the 
current input : 


The update mechanism helps the GRU to capture long¬ 
term dependencies. Whenever a previously detected fea¬ 
ture, or the memory content is considered to be important 
for later use, the update gate will be closed to carry the cur¬ 
rent memory content across multiple timesteps. The reset 
mechanism helps the GRU to use the model capacity effi¬ 
ciently by allowing it to reset whenever the detected feature 
is not necessary anymore. 


3. Gated Feedback Recurrent Neural 
Network 


Although capturing long-term dependencies in a sequence 
is an important and difficult goal of RNNs, it is worth¬ 
while to notice that a sequence often consists of both slow- 
moving and fast-moving components, of which only the 
former corresponds to long-term dependencies. Ideally, an 
RNN needs to capture both long-term and short-term de¬ 
pendencies. 


El Hihi & Bengio| ( [T995| ) first showed that an RNN can cap¬ 


ture these dependencies of different timescales more easily 
and efficiently when the hidden units of the RNN is ex¬ 
plicitly partitioned into groups that correspond to differ¬ 
ent timescales. The clockwork RNN (CW-RNN) ( [Koutnikl 


et al. 2014) implemented this by allowing the i-th mod¬ 
ule to operate at the rate of 2*“^, where i is a positive 
integer, meaning that the module is updated only when 
t mod 2*“^ =0. This makes each module to operate at dif¬ 
ferent rates. In addition, they precisely defined the connec¬ 
tivity pattern between modules by allowing the i-th module 
to be affected by j-th module when j > i. 


Here, we propose to generalize the CW-RNN by allowing 
the model to adaptively adjust the connectivity pattern be¬ 
tween the hidden layers in the consecutive timesteps. Simi¬ 
lar to the CW-RNN, we partition the hidden units into mul¬ 
tiple modules in which each module corresponds to a dif¬ 
ferent layer in a stack of recurrent layers. 


Zt=a{WzXt+Uzht-i), ( 8 ) 

The new memory content hj is computed similarly to the 
conventional transition function in Eq. 0: 

h( = tanh {Wxf + rj © Uht-i), (9) 

where 0 is an element-wise multiplication. 

One major difference from the traditional transition func¬ 
tion (Eq. ([T])) is that the states of the previous step ht_i 
is modulated by the reset gates r^. This behavior allows 
a GRU to ignore the previous hidden states whenever it is 
deemed necessary considering the previous hidden states 
and the current input: 


Unlike the CW-RNN, however, we do not set an explicit 
rate for each module. Instead, we let each module oper¬ 
ate at different timescales by hierarchically stacking them. 
Each module is fully connected to all the other modules 
across the stack and itself. In other words, we do not de¬ 
fine the connectivity pattern across a pair of consecutive 
timesteps. This is contrary to the design of CW-RNN and 
the conventional stacked RNN. The recurrent connection 
between two modules, instead, is gated by a logistic unit 
([0,1]) which is computed based on the current input and 
the previous states of the hidden layers. We call this gating 
unit a global reset gate, as opposed to a unit-wise reset gate 
which applies only to a single unit (See Eqs. 0 and 0). 


Yt =a {WrXt + Urht-i) . 


( 10 ) 
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Figure 1. Illustrations of (a) conventional stacking approach and (b) gated-feedback approach to form a deep RNN architecture. Bullets 
in (b) correspond to global reset gates. Skip connections are omitted to simplify the visualization of networks. 


The global reset gate is computed as: 




= cr W 




+ urJh;_ 


.)■ 


where h*_^ is the concatenation of all the hidden states 
from the previous timestep t — 1. The superscript is 
an index of associated set of parameters for the transition 
from layer i in timestep t — 1 to layer j in timestep t. 
and are respectively the weight vectors for the current 
input and the previous hidden states. When j = 1, is 
xt. 

In other words, the signal from to hi is controlled by 
a single scalar which depends on the input and all 
the previous hidden states h^_i. 

We call this RNN with a fully-connected recurrent transi¬ 
tions and global reset gates, a gated-feedback RNN (GF- 
RNN). Fig.l^illustrates the difference between the conven¬ 
tional stacked RNN and our proposed GF-RNN. In both 
models, information flows from lower recurrent layers to 
upper recurrent layers. The GF-RNN, however, further 
allows information from the upper recurrent layer, corre¬ 
sponding to coarser timescale, flows back into the lower 
recurrent layers, corresponding to finer timescales. 

In the remainder of this section, we describe how to use 
the previously described LSTM unit, GRU, and more tra¬ 
ditional tanh unit in the GF-RNN. 


layer is computed by 

h^' =tanh + , 

where L is the number of hidden layers, and 

are the weight matrices of the current input and 
the previous hidden states of the i-th module, respectively. 
Compared to Eq. Q, the only difference is that the previ¬ 
ous hidden states are from multiple layers and controlled 
by the global reset gates. 


Long Short-Term Memory and Gated Recurrent Unit. 

In the cases of LSTM and GRU, we do not use the global 
reset gates when computing the unit-wise gates. In other 
words, Eqs. 0-([^ for LSTM, and Eqs. 0 and ( p^ for 
GRU are not modified. We only use the global reset gates 
when computing the new state (see Eq. ^ for LSTM, and 
Eq. 0 for GRU). 

The new memory content of an LSTM at the j-th layer is 
computed by 

ci = tanh + 

In the case of a GRU, similarly, 

h{ = tanh + rj © 

\ i = l 

4. Experiment Settings 




3.1. Practical Implementation of GF-RNN 


4.1. Tasks 


tanh Unit. For a stacked tanh-RNN, the signal from the We evaluated the proposed GF-RNN on character-level lan- 
previous timestep is gated. The hidden state of the j-th guage modeling and Python program evaluation. Both 
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tasks are representative examples of discrete sequence 
modeling, where a model is trained to minimize the neg¬ 
ative log-likelihood of training sequences: 


N T, 


mm 
G N 




X 


15 • 


n=l t=l 


• 5 


U-,0), 


where 0 is a set of model parameters. 


4.1.1. Language Modeling 

We used the dataset made available as a part of the human 
knowledge compression contest ( [Hutter[ |2012| ). We refer 
to this dataset as the Hutter dataset. The dataset, which 
was built from English Wikipedia, contains 100 MBytes of 
characters which include Latin alphabets, non-Latin alpha¬ 
bets, XML markups and special characters. Closely follow¬ 
ing the protocols in ( [Mikolov et aH |2012[ |Graves| |2013| ), 
we used the first 90 MBytes of characters to train a model, 
the next 5 MBytes as a validation set, and the remaining 
as a test set, with the vocabulary of 205 characters includ¬ 
ing a token for an unknown character. We used the average 
number of bits-per-character (BPC, E[— log 2 P(xt+i |ht)]) 
to measure the performance of each model on the Hutter 
dataset. 


4.1.2. Python Program Evaluation 


Table 1 . The sizes of the models used in character-level language 
modeling. Gated Feedback L is a GF-RNN with a same number 
of hidden units as a Stacked RNN (but more parameters). The 
number of units is shown as (number of hidden layers) 
X (number of hidden units per layer) . 


Unit 

Architecture 

# of Units 


Single 

1 X 1000 

tanh 

Stacked 

3 X 390 


Gated Feedback 

3 X 303 


Single 

1 X 540 

GRU 

Stacked 

3 X 228 

Gated Feedback 

3 X 165 


Gated Feedback L 

3 X 228 


Single 

1 X 456 

LSTM 

Stacked 

3 X 191 

Gated Feedback 

3 X 140 


Gated Feedback L 

3 X 191 


4.2. Models 

We compared three different RNN architectures: a single¬ 
layer RNN, a stacked RNN and the proposed GF-RNN. For 
each architecture, we evaluated three different transition 
functions: tanh affine, long short-term memory (LSTM) 
and gated recurrent unit (GRU). For fair comparison, we 
constrained the number of parameters of each model to be 
roughly similar to each other. 


Zaremba & Sutskever ( 2014| ) recently showed that an RNN, 


more specifically a stacked LSTM, is able to execute a short 
Python script. Here, we compared the proposed architec¬ 
ture against the conventional stacking approach model on 
this task, to which refer as Python program evaluation. 


For each task, in addition to these capacity-controlled ex¬ 
periments, we conducted a few extra experiments to further 
test and better understand the properties of the GF-RNN. 

4.2.1. Language Modeling 


Scripts used in this task include addition, multiplication, 
subtraction, for-loop, variable assignment, logical compar¬ 
ison and if-else statement. The goal is to generate, or pre¬ 
dict, a correct return value of a given Python script. The 
input is a program while the output is the result of a print 
statement: every input script ends with a print statement. 
Both the input script and the output are sequences of char¬ 
acters, where the input and output vocabularies respectively 
consist of 41 and 13 symbols. 


The advantage of evaluating the models with this task is 
that we can artificially control the difficulty of each sam¬ 
ple (input-output pair). The difficulty is determined by 
the number of nesting levels in the input sequence and the 
length of the target sequence. We can do a finer-grained 
analysis of each model by observing its behavior on exam¬ 
ples of different difficulty levels. 


In Python program evaluation, we closely follow (Zaremba 


|& Sutskever[ |2Q14| ) and compute the test accuracy as the 
next step symbol prediction given a sequence of correct 
preceding symbols. 


For the task of character-level language modeling, we con¬ 
strained the number of parameters of each model to corre¬ 
spond to that of a single-layer RNN with 1000 tanh units 
(see Table [T] for more details). Each model is trained for at 
most 100 epochs. 


We used RMSProp (Hinton 2012) and momentum to tune 
the model parameters ( [Graves 2013| ). According to the 
preliminary experiments and their results on the validation 
set, we used a learning rate of 0.001 and momentum coef¬ 
ficient of 0.9 when training the models having either GRU 
or LSTM units. It was necessary to choose a much smaller 
learning rate of 5 x 10“^ in the case of tanh units to ensure 
the stability of learning. Whenever the norm of the gradient 
explodes, we halve the learning rate. 


Each update is done using a minibatch of 100 subsequences 
of length 100 each, to avoid memory overfiow problems 
when unfolding in time for backprop. We approximate full 
back-propagation by carrying the hidden states computed 
at the previous update to initialize the hidden units in the 
next update. After every 100-th update, the hidden states 
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Figure 2. Validation learning curves of three different RNN architectures; Stacked RNN, GF-RNN with the same number of model 
parameters and GF-RNN with the same number of hidden units. The curves represent training up to 100 epochs. Best viewed in colors. 


were reset to all zeros. 


Table 2. Test set BPC (lower is better) of models trained on the 
Hutter dataset for a 100 epochs. (*) The gated-feedback RNN 
with the global reset gates fixed to 1 (see Sec. for details). 
Bold indicates statistically significant winner over the column 
(same type of units, different overall architecture). 



tanh 

GRU 

LSTM 

Single-layer 

1.937 

1.883 

1.887 

Stacked 

1.892 

1.871 

1.868 

Gated Feedback 

1.949 

1.855 

1.842 

Gated Feedback L 

- 

1.813 

1.789 

Feedback* 

- 

- 

1.854 


4.2.2. Python Program Evaluation 


For the task of Python program evaluation, we used an 
RNN encoder-decoder based approach to learn the map¬ 
ping from Python scripts to the corresponding outputs as 
done by |Cho et al.] ( |2014| ); [Sutskever et al.| ( |2014| ) for ma¬ 
chine translation. When training the models, Python scripts 
are fed into the encoder RNN, and the hidden state of the 
encoder RNN is unfolded for 50 timesteps. Prediction is 
performed by the decoder RNN whose initial hidden state 
is initialized with the last hidden state of the encoder RNN. 
The first hidden state of encoder RNN ho is always initial¬ 
ized to a zero vector. 


For this task, we used GRU and FSTM units either with 
or without the gated-feedback connections. Each encoder 
or decoder RNN has three hidden layers. For GRU, each 
hidden layer contains 230 units, and for FSTM each hidden 
layer contains 200 units. 


Following |Zaremba & Sutskever| ( |2014] ), we used the mixed 


curriculum strategy for training each model, where each 
training example has a random difficulty sampled uni¬ 
formly. We generated 320,000 examples using the script 
provided by |Zaremba & Sutskever] ( [2014| ), with the nesting 
randomly sampled from [1,5] and the target length from 
[l,10io]. 


We used Adam ( [Kingma & Ba[|MT^ to train our models, 
and each update was using a minibatch with 128 sequences. 
We used a learning rate of 0.001 and /3i and ^2 were both 
set to 0.99. We trained each model for 30 epochs, with 
early stopping based on the validation set performance to 
prevent over-fitting. 


At test time, we evaluated each model on multiple sets of 
test examples where each set is generated using a fixed tar¬ 
get length and number of nesting levels. Each test set con¬ 
tains 2,000 examples which are ensured not to overlap with 
the training set. 


5. Results and Analysis 
5.1. Language Modeling 

It is clear from Tablej^that the proposed gated-feedback ar¬ 
chitecture outperforms the other baseline architectures that 
we have tried when used together with widely used gated 
units such as LSTM and GRU. However, the proposed ar¬ 
chitecture failed to improve the performance of a vanilla- 
RNN with tanh units. In addition to the final modeling 
performance, in Fig. we plotted the learning curves of 
some models against wall-clock time (measured in sec¬ 
onds). RNNs that are trained with the proposed gated- 
feedback architecture tends to make much faster progress 
over time. This behavior is observed both when the number 
of parameters is constrained and when the number of hid- 





















































Gated Feedback Recurrent Neural Networks 


Table 3. Generated texts with our trained models. Given the seed at the left-most column (bold-faced font), the models predict next 
200 ~ 300 characters. Tabs, spaces and new-line characters are also generated by the models. 


Seed 

Stacked LSTM 

GF-LSTM 

[ [pi:Icon] ] 

<revision> 

<revision> 

[ [ptilcon] ] 

<id>15908383</id> 

<id>41968413</id> 

[ [ru:Icon]] 

<timestamp> 

<timestamp> 

[ [svrProgramspraket Icon] ] </text> 

2002-07-20Tl8:33:34Z 

2006-09-03Tll:38:06Z 

</revision> 

</timestamp> 

</timestamp> 

</page> 

<contributor> 

<contributor> 

<page> 

<username>The Courseichi</userrand 

<username>Navisb</username> 

<title>Iconology</title> 

vehicles in [ [enguit]] . 

<id>46264</id> 

<id>14802</id> 

==The inhibitors and alphabetsy and moral/ 

</contributor> 

<revi 

hande in===In four [[communications]] and 

<comment>The increase from the time 

<title>Inherence relation</title> 

<username>Robert]] 

<username>Roma</username> 

<id>14807</id> 

[ [su:20 aves]] 

<id>48</id> 

<revision> 

[[vi:10 Februari]] 

</contributor> 

<id>34980694</id> 

[[bi:16 agostoferosin]] 

<comment>Vly''' and when one hand 

<timestamp> 

[[pt:Darenetische]] 

is angels and [[ghost]] borted and 

2006-01-13T04:19:25Z 

[[eoiHebrew selsowen]] 

''mask r:centrions]], [[Afghanistan]], 

</timestamp> 

[[hr:2 febber]] 

[[Glencoddic tetrahedron]], [[Adjudan]], 

<contributor> 

[[io:21 februari]] 

[[Dghacn]], for example, in which materials 

<username>Ro 

[[it: 18 de februari]] 

dangerous (carriers) can only use with one 


den units is constrained. This suggests that the proposed 
GF-RNN significantly facilitates optimization/learning. 

Effect of Global Reset Gates 


both of them, which shows that it learned about the struc¬ 
ture of XML tags. This type of behavior could be seen 
throughout all ten random generations. 


After observing the superiority of the proposed gated- 
feedback architecture over the single-layer or conventional 
stacked ones, we further trained another GF-RNN with 
LSTM units, but this time, after fixing the global reset 
gates to 1 to validate the need for the global reset gates. 
Without the global reset gates, feedback signals from the 
upper recurrent layers influence the lower recurrent layer 
fully without any control. The test set BPC of GF-LSTM 
without global reset gates was 1.854 which is in between 
the results of conventional stacked LSTM and GF-LSTM 
with global reset gates (see the last row of Table which 
confirms the importance of adaptively gating the feedback 
connections. 

Qualitative Analysis: Text Generation 

Here we qualitatively evaluate the stacked LSTM and GF- 
LSTM trained earlier by generating text. We choose a sub¬ 
sequence of characters from the test set and use it as an 
initial seed. Once the model finishes reading the seed text, 
we let the model generate the following characters by sam¬ 
pling a symbol from softmax probabilities of a timestep and 
then provide the symbol as next input. 

Given two seed snippets selected randomly from the test 
set, we generated the sequence of characters ten times for 
each model (stacked LSTM and GF-LSTM). We show one 
of those ten generated samples per model and per seed snip¬ 
pet in Table We observe that the stacked LSTM failed to 
close the tags with </username> and </contributor> 
in both trials. However, the GF-LSTM succeeded to close 


Table 4. Test set BPC of neural language models trained 
on the Hutter dataset, MRNN = multiplicative RNN re¬ 
sults from [Sutskever et al.| ( |2011| ) and Stacked LSTM results 
from |Gra\^ ( 2013| l. 


MRNN 

Stacked LSTM 

GF-LSTM 

1.60 

1.67 

1.58 


Large GF-RNN 


We trained a larger GF-RNN that has five recurrent layers, 
each of which has 700 LSTM units. This makes it possible 
for us to compare the performance of the proposed archi¬ 
tecture against the previously reported results using other 
types of RNNs. In Tabl e we present the test set BPC 
by a multiplicative RNN d^tskever et al.||2011|), a stacked 
LSTM([G raves [[MTS] ) and the GF-RNN with LSTM units. 
The performance of the proposed GF-RNN is comparable 
to, or better than, the previously reported best results. Note 
that Sutskever et al. ( 2011| ) used the vocabulary of 86 char¬ 
acters (removed XML tags and the Wikipedia markups), 
and their result is not directly comparable with ours. In this 
experiment, we used Adam instead of RMSProp to opti¬ 
mize the RNN. We used learning rate of 0.001 and pi and 
P 2 were set to 0.9 and 0.99, respectively. 


5.2. Python Program Evaluation 

Fig. [^presents the test results of each model represented in 
heatmaps. The accuracy tends to decrease by the growth 
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(a) Stacked RNN 


(b) Gated Feedback RNN 


(c) Gaps between (a) and (b) 


Figure 3. Heatmaps of (a) Stacked RNN, (b) GF-RNN, and (c) difference obtained by substracting (a) from (b). The top row is the 
heatmaps of models using GRUs, and the bottom row represents the heatmaps of the models using LSTM units. Best viewed in colors. 


of the length of target sequences or the number of nesting 
levels, where the difficulty or complexity of the Python pro¬ 
gram increases. We observed that in most of the test sets, 
GF-RNNs are outperforming stacked RNNs, regardless of 
the type of units. Fig.[^(c) represents the gaps between the 
test accuracies of stacked RNNs and GF-RNNs which are 
computed by subtracting (a) from (b). In Fig.|^(c), the red 
and yellow colors, indicating large gains, are concentrated 
on top or right regions (either the number of nesting lev¬ 
els or the length of target sequences increases). From this 
we can more easily see that the GF-RNN outperforms the 
stacked RNN, especially as the number of nesting levels 
grows or the length of target sequences increases. 

6. Conclusion 


stacked RNN with a same amount of capacity. Large GF- 
LSTM was able to outperform the previously reported best 
results on character-level language modeling. This sug¬ 
gests that GF-RNNs are also scalable. GF-RNNs were able 
to outperform standard stacked RNNs and the best previ¬ 
ous records on Python program evaluation task with vary¬ 
ing difficulties. 

We noticed a deterioration in performance when the pro¬ 
posed gated-feedback architecture was used together with 
a tank activation function, unlike when it was used with 
more sophisticated gated activation functions. More thor¬ 
ough investigation into the interaction between the gated- 
feedback connections and the role of recurrent activation 
function is required in the future. 


We proposed a novel architecture for deep stacked RNNs 
which uses gated-feedback connections between differ¬ 
ent layers. Our experiments focused on challenging se¬ 
quence modeling tasks of character-level language mod¬ 
eling and Python program evaluation. The results were 
consistent over different datasets, and clearly demonstrated 
that gated-feedback architecture is helpful when the mod¬ 
els are trained on complicated sequences that involve long¬ 
term dependencies. We also showed that gated-feedback 
architecture was faster in wall-clock time over the train¬ 
ing and achieved better performance compared to standard 
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